The main goal in this notebook is to show how you take your ML model from a basic jupyter notebook to production platform. By preparing this, you do not need to rely on any integration people touching your code before production, meaning a clean handover for deployment
The dataset can be found here
Credits: to one of my favourite data scientist - check out Mike's channel
import pandas as pd
import pandas_profiling as pp #%pip3 install pandas-profiling
import matplotlib
import numpy as np
from sklearn.linear_model import LogisticRegression #To be used when you are building predictions on a categorical class. In our case we have two classes. For example if you should have predicted
# something against a range, like how warm it is going to be on a given day, that can be any number of values e.g (1-100) then you should use
# LinearRegression.
from sklearn.model_selection import cross_val_score # Used for cross validation, gives us a basic score on how well the model ended up being trained.
from sklearn.model_selection import KFold #Since its default to 5 in the cross validation we import this package for tuning the model.
columns = ["sample", "thickness", "size_unif", "shape_unif", "marginal_adhesion", "epithelial_cell_size", "bare_nuclei", "bland_chromatin","normal_nucleoli","mitoses","class" ] #list of names representing the dataset including in subfile "Names" since the dataset does not have column rows label names in the data file
bc_raw = pd.read_csv("Data/breast-cancer-wisconsin.data", names=columns, na_values=["?"]) # Added na_values=["?"]
bc_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 699 entries, 0 to 698 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sample 699 non-null int64 1 thickness 699 non-null int64 2 size_unif 699 non-null int64 3 shape_unif 699 non-null int64 4 marginal_adhesion 699 non-null int64 5 epithelial_cell_size 699 non-null int64 6 bare_nuclei 683 non-null float64 7 bland_chromatin 699 non-null int64 8 normal_nucleoli 699 non-null int64 9 mitoses 699 non-null int64 10 class 699 non-null int64 dtypes: float64(1), int64(10) memory usage: 60.1 KB
It shows that all of the columns are integers, expect for bare_nuclei, lets look deeper why. From the documentation this should be a dataset with only integers.
bc_raw.bare_nuclei.unique()
array([ 1., 10., 2., 4., 3., 9., 7., nan, 5., 8., 6.])
Since the data in the column includes an "?" it automatically set the dtype as an object.
We will inspect the data more using pandas profiling.
pp.ProfileReport(bc_raw, progress_bar=False) #This will be embedded in the jupyter notebook, however it is many ways to save is seperatly in either eg html or json.
I find this profiling report to be very usefull for basic exploratory data analysis. It saves me for a lot of time, so just spend some time to get familiar with the data and look for data with missing quality expected for the model.
Anyway we have
8 duplicate rows
bare_nucei which is non-numeric due to the question mark in the dataset.
Class is unbalanced.
shape and size uniformity are highly correlated.
2 categorical variables where one of those is due to 16 values of the question mark character string column. We cannot set this to NaN due to the later model building. We will not use imputing (example calculate is a mean or other values). We will just remove the column.
bc_clean = bc_raw.copy() #copying the raw file
Missing values
bc_clean.dropna(inplace=True) #This will save memory as it performing the operation on the current dataframe (bc_clean) and give the result without making a copy of it
print(bc_clean.shape)
(683, 11)
Duplicates
bc_clean.drop_duplicates(subset=columns[1:], inplace=True) #drop the 8 duplicates, but not the sample column (0)
print(bc_clean.shape)
(449, 11)
# Train 90% Test 10% Fold#1-5 <-- Cross validation for each repeat
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# learn from the past, to predict the future = predictive skill --> is the ability to accuratly outcome of the data which is never seen before.
# We will use the data (measurements of a breast cancer tumor) from all the columns (features) to train a model, and use this model to later
# predict the Class, meaning it will become either 2 for benign or 4 for malignant on new measurements for other tumors.
# We havent realy pretuned our model. We will try to change the algorithm (solver) By the default it is set to lbfgs.
bc_clean = bc_clean.reset_index(drop=True) #Since we dropped some rows earlier, there exists gap in the index. We therefore need to re-index starting with 0.
X = bc_clean.loc[:, "thickness":"mitoses"] #Exluding the sample and class column
y = bc_clean.loc[:, "class"] #The answers to all of these predictions in X. Its going to build the model based on the X data values, and compare the performance on the model to the actual answers.
model = LogisticRegression(solver="sag", max_iter=2000) #Declare the model. Create an empty logistic Regression variable.
kfold = KFold(n_splits=3, shuffle=True, random_state=100) #Crossfold #Number of splits (default #5), shuffle is for shuffeling (instead of getting it from top to bottom), and then you also need random state (for seeding correct seq when sharing the ML model).
results = cross_val_score(model, X, y, cv=kfold) #Returns the accuracy of the folds
print("Accuracy: %.1f%% (%.1f%%)" % (results.mean()*100, results.std()*100))
Accuracy: 95.3% (1.6%)
#We see that this accuracy is very high, and the std is 1,6% (same units) away from each others.
print(results)
[0.94444444 0.94444444 0.93333333 0.97777778 0.96629213]
pm_model = LogisticRegression() # the default is good enough here
pm_model.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
import pickle # save the model to a file (see documentation: https://docs.python.org/3/library/pickle.html)
pickle.dump(pm_model, open("pm_model.pkl", "wb")) #saving the file #Write in a binary format
Result
Ready for production
#Load the model as a an object
pm_deploy = pickle.load(open("pm_model.pkl", "rb")) #now we have our model loaded in a object #read binary
#Our model will now provide a result with given input.
# labs = pd.read_csv(inputdata.csv) # Assume this as a csv file stored as a dataframe without the result (class) which will be used for input
# result = pm_deploy.predict(labs) # Voila. This can also be stored as a web.applicaton or exported to CSV.